28 research outputs found

    Simulation of multi-core systems

    Get PDF
    The computer systems market has been increasing significantly since the beginning of the cloud computing era. This demand leads to an increase on computer architectures complexity and efficiency. The simulation step is one of the most important during the development of new architectures, it eliminates the need of the real hardware during the initial developing phases. In this work, we propose an ARM Neoverse N1 gem5 simulator model. We calibrate the cache memories of the model using microbenchmarks on the model and comparing with the real hardware architecture. The results of the work show that our calibration method reaches cache delay access time accuracy close to the real hardware.O mercado de sistemas de computação tem aumentado significativamente desde o início da era da computação na nuvem. Esta demanda leva a um aumento na complexidade e eficiência das arquiteturas de computadores. A etapa de simulação é uma das mais importantes durante o desenvolvimento destas: ela elimina a necessidade do hardware real durante as fases iniciais do fluxo de desenvolvimento. Neste trabalho, propomos um modelo de simulação da arquitetura ARM Neoverse N1 utilizando o simulador gem5. Calibramos as memórias caches do modelo usando microbenchmarks, comparando com o hardware real que implementa esta arquitetura . Os resultados do trabalho mostram que nosso método de calibração atinge tempos de acesso às caches próximo ao hardware real

    High-Performance Solvers for Dense Hermitian Eigenproblems

    Full text link
    We introduce a new collection of solvers - subsequently called EleMRRR - for large-scale dense Hermitian eigenproblems. EleMRRR solves various types of problems: generalized, standard, and tridiagonal eigenproblems. Among these, the last is of particular importance as it is a solver on its own right, as well as the computational kernel for the first two; we present a fast and scalable tridiagonal solver based on the Algorithm of Multiple Relatively Robust Representations - referred to as PMRRR. Like the other EleMRRR solvers, PMRRR is part of the freely available Elemental library, and is designed to fully support both message-passing (MPI) and multithreading parallelism (SMP). As a result, the solvers can equally be used in pure MPI or in hybrid MPI-SMP fashion. We conducted a thorough performance study of EleMRRR and ScaLAPACK's solvers on two supercomputers. Such a study, performed with up to 8,192 cores, provides precise guidelines to assemble the fastest solver within the ScaLAPACK framework; it also indicates that EleMRRR outperforms even the fastest solvers built from ScaLAPACK's components

    Optimal load balancing techniques for block-cyclic decompositions for matrix factorization

    Get PDF
    In this paper, we present a new load balancing technique, called panel scattering, which is generally applicable for parallel block-partitioned dense linear algebra algorithms, such as matrix factorization. Here, the panels formed in such computation are divided across their length, and evenly (re-)distributed among all processors. It is shown how this technique can be eÆciently implemented for the general block-cyclic matrix distribution, requiring only the collective communication primitives that required for block-cyclic parallel BLAS. In most situations, panel scattering yields optimal load balance and cell computation speed across all stages of the computation. It has also advantages in naturally yielding good memory access patterns. Compared with traditional methods which minimize communication costs at the expense of load balance, it has a small (in some situations negative) increase in communication volume costs. It however incurs extra communication startup costs, but only by a factor not exceeding 2. To maximize load balance and minimize the cost of panel re-distribution, storage block sizes should be kept small; furthermore, in many situations of interest, there will be no significant communication startup penalty for doing so. Results will be given on the Fujitsu AP+ parallel computer, which will compare the performance of panel scattering with previously established methods, for LU, LLT and QR factorization. These are consistent with a detailed performance model for LU factorization for each method that is developed here

    ELSI: A Unified Software Interface for Kohn-Sham Electronic Structure Solvers

    Full text link
    Solving the electronic structure from a generalized or standard eigenproblem is often the bottleneck in large scale calculations based on Kohn-Sham density-functional theory. This problem must be addressed by essentially all current electronic structure codes, based on similar matrix expressions, and by high-performance computation. We here present a unified software interface, ELSI, to access different strategies that address the Kohn-Sham eigenvalue problem. Currently supported algorithms include the dense generalized eigensolver library ELPA, the orbital minimization method implemented in libOMM, and the pole expansion and selected inversion (PEXSI) approach with lower computational complexity for semilocal density functionals. The ELSI interface aims to simplify the implementation and optimal use of the different strategies, by offering (a) a unified software framework designed for the electronic structure solvers in Kohn-Sham density-functional theory; (b) reasonable default parameters for a chosen solver; (c) automatic conversion between input and internal working matrix formats, and in the future (d) recommendation of the optimal solver depending on the specific problem. Comparative benchmarks are shown for system sizes up to 11,520 atoms (172,800 basis functions) on distributed memory supercomputing architectures.Comment: 55 pages, 14 figures, 2 table

    Studies in Rheology: Molecular Simulation and Theory

    Get PDF
    With an enormous advance in the capability of computers during the last fewdecades, the computer simulation has become an important tool for scientific researches in many areas such as physics, chemistry, biology, and so on. In particular, moleculardynamics (MD) simulations have been proven to be of a great help in understanding the rheology of complex fluids from the fundamental microscopic viewpoint. There are two important standard flows in rheology: shear flow and elongational flow. While there exist suitable nonequilibrium MD (NEMD) algorithms of shear flows, such as the Lees-Edwards purely boundary-driven algorithm and the so-called SLLOD algorithm as a field-driven algorithm, a proper NEMD algorithm for elongational flow has been lacking. The main difficulty of simulating elongational flow lies in the limited simulation time available due to the contraction of one or two dimensions dictated by itskinematics. This problem, however, has been partially resolved by Kraynik and Reinelt’s ingenious discovery of the temporal and spatial periodicity of lattice vectors in planar elongational flow (PEF). Although there have been a few NEMD simulations of PEF using their idea, another serious defect has recently been reported when using the SLLOD algorithm in PEF: for adiabatic systems, the total linear momentum of the system in the contracting direction grows exponentially with time, which eventually leads to an aphysical phase transition.This problem has been completely resolved by using the so-called ‘proper-SLLOD’ or ‘p-SLLOD’ algorithm, whose development has been one of the mainaccomplishments of this study. The fundamental correctness of the p-SLLOD algorithm has been demonstrated quite thoroughly in this work through detailed theoretical analyses together with direct simulation results. Both theoretical and simulation works achieved in this research are expected to play a significant role in advancing the knowledge of rheology, as well as that of NEMD simulation itself for other types of flow in general. Another important achievement in this work is the demonstration of the possibility of predicting a liquid structure in nonequilibrium states by employing a concept of ‘hypothetical’ nonequilibrium potentials. The methodology developed in this work has been shown to have good potential for further developments in this field

    Accelerated methods for performing the LDLT decomposition

    Get PDF
    This paper describes the design, implementation and performance of parallel direct dense symmetric-indefinite matrix factorisation algorithms. These algorithms use the Bunch-Kaufman diagonal pivoting method. The starting point is numerically identical to LAPACK _sytrf() algorithm, but out-performs zsytrf() by approximately 15% for large matrices on the UltraSPARC family of processors. The first variant reduces symmetric interchanges, particularly important for parallel implementation, by taking into account the growth attained by any preceding columns that did not require any interchanges. However, it achieves the same growth bound. The second variant uses a lookahead technique with heuristic methods to predict whether interchanges are required over the next block column; if so, the block column can be eliminated using modified Cholesky methods, which can yield both computational and communication advantages. These algorithms yield best performance gains on `weakly indefinite' matrices (i.e. those which have generally large diagonal elements), which often arise from electro-magnetic field analysis applications. On UltraSPARC processors, the first variant generally achieves a 1--2% performance gain; the second is faster still for large matrices by 2% for complex double precision and 6% for double precision. However, larger performance gains are observed on distributed memory machines, where symmetric interchanges are relatively more expensive. On a 16 node 300MHz UltraSPARC-based Fujitsu AP3000, the first variant achieved a 10-15% improvement for small-moderate sized matrices, decreasing to 7% for large matrices. For N=10000 , it achieved a sustained speed of 5.6GFLOPs and a parallel speedup of 12.8

    Algorithmic Cholesky factorization fault recovery

    Full text link
    Abstract—Modeling and analysis of large scale scientific systems often use linear least squares regression, frequently employing Cholesky factorization to solve the resulting set of linear equations. With large matrices, this often will be performed in high performance clusters containing many processors. Assuming a constant failure rate per processor, the probability of a failure occurring during the execution increases linearly with additional processors. Fault tolerant methods attempt to reduce the expected execution time by allowing recovery from failure. This paper presents an analysis and implementation of a fault tolerant Cholesky factorization algorithm that does not require checkpointing for recovery from fail-stop failures. Rather, this algorithm uses redundant data added in an additional set of processors. This differs from previous works with algorithmic methods as it addresses fail-stop failures rather than fail-continue cases. The implemen-tation and experimentation using ScaLAPACK demonstrates that this method has decreasing overhead in relation to overall runtime as the matrix size increases, and thus shows promise to reduce the expected runtime for Cholesky factorizations on very large matrices
    corecore